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Abstract 

Locality-sensitive hashing (LSH) is a popular data- 
independent indexing method for approximate similarity 
search, where random projections followed by quantization 
hash the points from the database so as to ensure that the 
probability of collision is much higher for objects that are 
close to each other than for those that are far apart. Most 
of high-dimensional visual descriptors for images exhibit a 
natural matrix structure. When visual descriptors are repre¬ 
sented by high-dimensional feature vectors and long binary 
codes are assigned, a random projection matrix requires ex¬ 
pensive complexities in both space and time. In this pa¬ 
per we analyze a bilinear random projection method where 
feature matrices are transformed to binary codes by two 
smaller random projection matrices. We base our theoret¬ 
ical analysis on extending Raginsky and Lazebnik’s result 
where random Fourier features are composed with random 
binary quantizers to form locality sensitive binary codes. 
To this end, we answer the following two questions: (1) 
whether a bilinear random projection also yields similarity¬ 
preserving binary codes; (2) whether a bilinear random 
projection yields performance gain or loss, compared to 
a large linear projection. Regarding the first question, we 
present upper and lower bounds on the expected Hamming 
distance between binary codes produced by bilinear ran¬ 
dom projections. In regards to the second question, we an¬ 
alyze the upper and lower bounds on covariance between 
two bits of binary codes, showing that the correlation be¬ 
tween two bits is small. Numerical experiments on MNIST 
and Flickr45K datasets confirm the validity of our method. 


1. Introduction 

Nearest neighbor search, the goal of which is to find 
most relevant items to a query given a pre-defined distance 
metric, is a core problem in various applications such as 
classification m, object matching ii, retrieval Qo), and 
so on. A naive solution to nearest neighbor search is lin- 



Figure 1. Two exemplary visual descriptors, which have a natural 
matrix structure, are often converted to long vectors. The above il¬ 
lustration describes LLC im, where spatial information of initial 
descriptors is summarized into a final concatenated feature with 
spatial pyramid structure. The bottom shows VLAD j?), where 
the residual between initial descriptors and their nearest visual vo¬ 
cabulary (marked as a triangle) is encoded in a matrix form. 


ear scan where all items in database are sorted according 
to their similarity to the query, in order to find relevant 
items, requiring linear complexity. In practical applications, 
however, linear scan is not scalable due to the size of ex¬ 
amples in database. Approximate nearest neighbor search, 
which trades accuracy for scalability, becomes more impor¬ 
tant than ever. Earlier work IIIIID is a tree-based approach 
that exploits spatial partitions of data space via various tree 
structures to speed up search. While tree-based methods 
are successful for low-dimensional data, their performance 
is not satisfactory for high-dimensional data and does not 
guarantee faster search compared to linear scan 0. 

For high-dimensional data, a promising approach is 
approximate similarity search via hashing. Locality- 
sensitive hashing (LSH) is a notable data-independent hash¬ 
ing method, where randomly generates binary codes such 
that two similar items in database are hashed to have high 
probability of collision 0 m HD. Different similarity 
metric leads to various LSH, including angle preservation 
0, fp norm (p G (0, 2]) 0, and shift-invariant kernels 
ED. Since LSH is a pure data-independent approach, it 
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needs multiple hash tables or long code, requiring high 
memory footprint. To remedy high memory consumption, 
data-dependent hashing ini [ni [la has been introduced 
to learn similarity-preserving binary codes from data such 
that embedding into a binary space preserves similarity be¬ 
tween data points in the original space. In general, data- 
dependent hashing generates compact binary codes, com¬ 
pared to LSH. However, LSH still works well, compared to 
data-dependent hashing methods for a very large code size. 

Most of existing hashing algorithms does not take into 
account a natural matrix structure frequently observed in 
image descriptors 111 El, as shown in Fig. [T] When a ma¬ 
trix descriptor is re-organized as a high-dimensional vec¬ 
tor, most of hashing methods suffer from high storage and 
time complexity due to a single large projection matrix. 
Given a d-dimensional vector, the space and time complex¬ 
ities to generate a code of size k are both 0{dk). In the 
case of 100,000-dimensionaI data, 40GB[^ is required to 
store a projection matrix to generate a binary code of length 
100,000, which is not desirable in constructing a large-scale 
vision system. 

Bilinear projection, which consists of left and right pro¬ 
jections, is a promising approach to handling data with 
a matrix structure. It has been successfully applied to 
two-dimensional principal component analysis (2D-PCA) 
ifTSll and 2D canonical correlation analysis (2D-CCA)|[8l, 
demonstrating that the time and space complexities are re¬ 
duced while retaining performance, compared to a single 
large projection method. Recently, bilinear projections are 
adopted to the angle-preserving LSH lb], where the space 
and time complexities are 0{\fdk) and 0{d'/k), to gener¬ 
ate binary codes of size k for a y/d by y/d matrix data. Note 
that when such matrix data is re-organized as d-dimensional 
vector, the space and time complexities for LSH are both 
0{dk). While promising results for hashing with bilinear 
projection are reported in il, its theoretical analysis is not 
available yet. 

In this paper we present a bilinear extension of LSH from 
shift-invariant kernels (LSH-SIK) CD and attempt to the 
following two questions on whether; 

• randomized bilinear projections also yield similarity¬ 
preserving binary codes; 

• there is performance gain or loss when randomized bi¬ 
linear projections are adopted instead of a large single 
linear projection. 


2. Related Work 


In this section we briefly review LSH algorithms for pre¬ 
serving angle ID or shift-invariant kernels CD- We also 
review an existing bilinear hashing method Ihl. 


2.1. LSH: Angle Preservation 

Given a vector x in a hash function ha(x) returns 
a value 1 or 0, i.e., ha{-) : {0,1}. We assume that 

data vectors are centered, i.e., where N is 

the number of samples. The random hyperplane-based hash 
function involves a random projection followed by a binary 
quantization, taking the form: 

^a(3:) = ^|l + sgn(u;^a:) |, (1) 

where lu € is a random vector sampled on a unit d- 
sphere and sgn( ) is the sign function which returns 1 when¬ 
ever the input is nonnegative and 0 otherwise. It was shown 
in ID that the random hyperplane method naturally gives a 
family of hash functions for vectors in such that 


ha{x) = ha{y) 


^x,y 

7 

TT 


( 2 ) 


where Ox.y denotes the angle between two vectors x and 
y. This technique, referred to as LSH-angle, works well for 
preserving an angle, but it does not preserve other types of 
similarities dehned by a kernel between two vectors. 


2.2. LSH: Shift-Invariant Kernels 


Locality-sensitive hashing from shift-invariant kernels 
CD, referred to as LSH-SIK, is a random projection-based 
encoding scheme, such that the expected Hamming distance 
between the binary codes of two vectors is related to the 
value of shift-invariant kernel between two vectors. Ran¬ 
dom Fourier feature (RFF) is dehned by 

cj)^{x) = y/2 cos{w^ Xb), (3) 

where w is drawn from a distribution corresponding to 
an underlying shift-invariant kernel, i.e., w ~ b is 
drawn from a uniform distribution over [0, 27r], i.e., b ^ 
Unif[0, 27r]. If the kernel k, is properly scaled, Bochner’s 
theorem guarantees = K(a;—y), where 

E is the statistical expectation and «;(•) represents a shift- 
invariant kernel C3- 

LSH-SIK builds a hash function, h{-) : i—> {0,1}, 

composing RFFs with a random binary quantization 


Our analysis shows that LSH-SIK with bilinear projections 
generates similarity-preserving binary codes and that the 
performance is not much degraded compared to LSH-SIK 
with a large single linear projection. 

*If we use single precision to represent a floating-point, the projection 
matrix needs 100, 000 X 100, 000 X 4 bytes Ri 40GB. 


where t ~ Unif[—1,1]. The most appealing property 
of LSH-SIK provides upper- and lower-bounds on the 
expected Hamming distance between any two embedded 
points, which is summarized in Theorem[T] 






Theorem 1. 4771/ Define the functions 


3. Analysis of Bilinear Random Projections 


51 (C) = 4(1-0 

TT^ 

52(C) = 

where ( € [0,1], and 51 (0) = 52 ( 0 ) = 4j, 51(1) = 
52 ( 1 ) = 0. Mercer kernel k is shift-invariant, normalized, 
and satisfies K{ax — ay) < k{x — y)for any a > 1. Then 
the expected Hamming distance between any two embedded 
points satisfies 


gi{K{x-y))<'ET[h{x)f^h{y)\ <g 2 iK{x-y)), (5) 


In this section we present the main contribution that is 
an theoretical analysis of a bilinear extension of LSH-SIK. 
To this end, we consider a hash function h{-) : 1 —>. 

{ 0 , 1 } that is of the form 

h{X) = + sgn^^cos (w^Xv + &) + (7) 

where w,v ~ b ~ Unif[0,27r], and t ^ 

Unif[—1,1]. With the abuse of notation, we use h{-) for 
the case of randomized bilinear hashing, however, it can be 
distinguished from Q, depending on its input argument x 
or X. To produce binary code of size k = the hash 

function H{-) : i->- {0,1}^ takes the form: 


where T[-\ is the indicator function which equals 1 if its ar¬ 
gument is true and 0 otherwise. 

The bounds in Theoremindicate that binary codes de¬ 
termined by LSH-SIK well preserve the similarity defined 
by the underlying shift-invariant kernel. 

2.3. Hashing with Bilinear Projections 

Most of high-dimensional descriptors for image, includ¬ 
ing HOG, Fisher Vector (FV), and VLAD, exhibit a natu¬ 
ral matrix structure. Suppose that X G is a de¬ 

scriptor matrix. The matrix is reorganized into a vector 
X = vec(X) G where d = dy,dy, and then a binary 
code of size k is determined by k independent use of the 
hash function 0 . This scheme requires 0{dk) in space 
and time. 

A bilinear projection-based method constructs a hash 
function Haf) : i->- {Q, 1 }^™'=’' that is of the form 

i74X)4i{l + sgn(vec(vF^Xy))}, (6) 

where W G and V G ^ , to produce a binary 

code of size k = k^ky. This scheme reduces space and 
time complexity to 0{dwky,-\-dyky) and 0{d'^ky,-\-d^ky), 
respectively, while a single large linear projection requires 
0{dy,dykyyky) iu spuce and timej^ Empirical results in ||6l 
indicate that a random bilinear projection produces com¬ 
parable performance compared to a single large projection. 
However, its theoretical behavior is not fully investigated. 
In the next section, we consider a bilinear extension of LSH- 
SIK Q and present our theoretical analysis. 


H{X) 4 i{l + sgn(cos(vec(iy^Xy) + 6 ) -ff)}, ( 8 ) 


where each column of W or of V is independently drawn 
from spherical Gaussian with zero mean and unit variance, 
each entry of 6 € or of t G is drawn uniformly from 
[0,27r] and [— 1 , 1 ], respectively. 

We attempt to answer two questions on whether: (1) bi¬ 
linear random projections also yield similarity-preserving 
binary codes like the original LSH-SIK; (2) there is perfor¬ 
mance gain or degradation when bilinear random projec¬ 
tions are adopted instead of a large linear projection. 

To answer the first question, we compute the up¬ 
per and lower bound on the expected Hamming distance 


E 


l[h{X) f h{Y)] 


between any two embedded points 


computed by bilinear LSH-SIK with Gaussian kernel. 
Compared to the the original upper and lower bounds for 
LSH-SIK ifTTIl with a single linear projection (Theorem [T]), 
our upper bound is the same and lower bound is slightly 
worse when the underlying kernel is Gaussian. 

Regarding the second question, note that some of bits 
of binary codes computed by the hash function Q share 
either left or right projection (column vector of W or V), 
leading to correlations between two bits. We show that the 
covariance between two bits is not high by analyzing the 
upper and lower bounds on covariance between the two bits. 


3.1. Random Fourier Features 


We begin with investigating the properties of random 
Fourier features KH in the case of bilinear projections, 
since BLSH-SIK (an abbreviation of bilinear LSH-SIK) 
bases its theoretical analysis on these properties. To this 
end, we consider bilinear RFF: 


^ Recently, ESI proposes a circulant embedding, which is implemented 
by discrete Fourier transform, to reduce the space and time complexities 
to 0{d) and 0{dlogd) when (d = d-w x dy) and the code length is d. 
Even though the circulant embedding is faster than bilinear projections, we 
believe that it is worth analyzing hashing with bilinear projections, because 
the implementation is simpler than El 


(X) 4 y2cos(u;^Xr; + 6 ), (9) 

where w,v ^ A/'(0, 1) and b ~ Unif[0, 27r]. 

In the case of randomized linear map where w ^ 
A/'(0,J), V.[4>y,{x)(t)yy{y)] = Kg{x - wKere Kg(-) is 








Gaussian kernel. Unfortunately, for the randomized bilin¬ 
ear map, E ^ Kg(vec(X — y)), where 

Gaussian kernel defined as 


Now we prove the following inequalities; 

AC9(vec(X - Y)) < Kb(X -Y) < Kg{wec{X - Y))°-'^^ 


Kg(vec(X — y)) = exp 
= exp 


1 

2 

1 
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||vec(X 


tr 


(X- 



y)(x-y) 



where X,Y G ^and the scaling parameter of 
Gaussian kernel is set as 1. However, we show that 
is between Kg(vec(X — Y)) and 
Kg (vec(X —y))°'^^, which is summarized in the following 
lemma. 

Lemma 1. Define A = (X — y)(X — y)^. Denote by 
{Aj} leading eigenvalues of X. The inner product between 
RFFs is given by 


Lower bound: First, we can easily show the lower bound 
on Kb{X — y) with the following inequality: Kg{\ec{X — 
^)) = nj exp(Ag)-5 < 11^(1 + = Kb{X - Y), 

because 1 -f Aj < exp(Aj). 

Upper bound: Second, assuming that ||X||^ < 0.8, we can 
derive the upper bound on Kb{X — y). Now, we can bound 
\j with the following logic. 

tr [(X - y)(X - y)^] < (2 * 0.8)^ = 2.56 
^ ^Afc<2.56 

k 

duj 

=> Ai<0.56 (•.• Afc > 0,Ai < 0.28^ Ai). 

i=2 


^w,v,b[4’w,viX)(j}ig^y(Y)'\ — ]^(1 + Aj) 

3 

A Kb{X-Y). (10) 

Then, Kb{X — Y) is upper and lower bounded in terms of 
Gaussian kernel Kg{vec{X — y)).' 

Kg{vec(X - Y)) < Kb{X -Y) < Kg{vec{X - y))° '^®, 

provided that the following assumptions are satisfied: 

• II-X^IIj’ a 0 .8, which can be easily satisfied by re¬ 
scaling the data. 


For 0 < Aj < 0.56, we know that exp(Ag)°'^® < 1 + Aj, 
leading to the upper bound on Kb{X — Y), i.e., Kb{X — 

y) < Kg(vec(x - y)) 0 ' 79 . □ 

Lemma[T]indicates that random Fourier features with bi¬ 
linear projections are related to the one with single projec¬ 
tion in case of Gaussian kernel. Due to this relation, we can 
conclude that random Fourier features with bilinear projec¬ 
tions can generate similarity-preserving binary codes in the 
following section. Finally, we summarize some important 
properties of Kb{X — Y), showing that Kb{X — Y) shares 
the similar properties with Kg{X — Y): 


• Ai < 0.28 ^^“2 which can be easily satisfied for 
large dy,. 

Proof. 


• Property 1: 0 < Kh[X — y) < 1. 

• Property 2; Kb{mX — mY) < Kb{X — Y), 
where m is a positive integer. 


fw,V {X'jfyj^y C^) 

J j cos {X — Y)v^p{w)p{v) dwdv 
j Kg(^{X — Y)^w'^p{w) dw 


= i2n)-^ 

= l-f + Al” 


exp< - 


{I -f A)w 


dw 


where | • | denotes the determinant of a matrix. The eigen- 
decomposition of A is given by A = UAU^, where U and 
A are eigenvector and eigenvalue matrices, respectively. 
Then we have 

\I + A\~^ = \UiI + A)U^\~^ 

= 11(1 + 

j 


Fig. 1^ demonstrates that the inner product of two data 
points induced by bilinear RFF is upper and lower bounded 
with respect to Gaussian kernel as shown in Lemma For 
the high dw, the upper bound is satished in Fig. I^c-d), 
which is consistent with our intuition. 

For Fig. we generate the data from an uniform 
distribution with different dimensions, and re-scale the 
data to be ||X||i? = 0.8. To compute the estimates 
of bilinear RFF, we independently generate 10,000 triples 
{wi,Vi,bi} and calculate the following sample average: 
J2i=i[(l^wi,viiX)fwi,vi{y)], where k = 10,000. For 
the estimates of RFF, we calculate the sample average with 
10,000 independently generated pairs {wi,bi}. 

3.2. Bounds on Expected Hamming Distance 

In this section, we derive the upper and lower bounds on 
the expected Hamming distance between binary codes com¬ 
puted by BLSH-SIK to show that BLSH-SIK can generate 
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Figure 2. Estimates of bilinear RFF (/^;>(-)) its lower/upper bounds (kg{-) and with respect to Gaussian kernel values. Red 

marks represent the inner products of two data points induced by bilinear RFF, and blue (black) marks represent its lower (upper) bounds. 
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Figure 3. Upper and lower bounds on the expected Hamming dis¬ 
tance between binary codes computed by BLSH-SIK and LSH- 


SIK. 


Theorem 2. Define the functions 


9i{0 

52(0 


A 



min 






where ( G [0,1] andgi{0) = 52(0) = gi(l) = 32(1) = 
0. Gaussian kernel Kg is shift-invariant, normalized, and 
satisfies Kg{ax — ay) < Kg{x — y) for any a > 1. Then 
the expected Hamming distance between any two embedded 
points computed by bilinear LSH-SIK satisfies 


gi^Kgir)) <E[l[h{X) f h{Y)]] <g2(sW)r dD 


similarity-preserving binary codes in the sense of Gaussian 
kernel. Lemma is a slight modification of the expected 
Hamming distance by LSH-SIK with a single projection 
m, indicating that the expected Hamming distance is ana¬ 
lytically represented. 

Lemma 2. 


E. 


W,V,b,t 


= —T 

7r2 


l[h{X) f h{Y)] 

1 — Kb{mX — mY) 


m—1 


Am? — 1 


Proof This is a slight modification of the result (for a ran¬ 
domized linear map) in im. Since the proof is straightfor¬ 
ward, it is placed in the supplementary material. □ 


Though the expected Hamming distance is analytically 
represented with respect to Kb, its relationship with Kg is 
not fully exploited. In order to figure out the similarity¬ 
preserving property of BLSH-SIK in a more clear way, The¬ 
orem]^ is described to show the upper and lower bounds on 
the expected Hamming distance for BLSH-SIK in terms of 
Kg{wec{X-Y)). 


where T = vec(X — Y). 

Proof We prove the upper and lower bound one at a time, 
following the technique used in HD. Note that the lower 
bound pi(C) is slightly different from the one in Theorem 
however the upper bound 52 (C) is the same as the one in 
Theorem [T] 

Lower bound: It follows from Property 2 and Lemma[2that 
we can easily find the lower bound as 

E[l[h{X) f h{Y)]\ > A(^i_^,(vec(X-r))) 

= S'i(«s(vec(X - K))). 

Upper bound: By the proof of Lemma 2.3 HD, we can 
easily find the upper bound as 


E 


l[h{X) f h{Y)] 


< min 





Moreover, the inequality Kg (x) < Kb (x) in Lemma 
















































































yields 


E 


l[h{X) ^ h{Y)] 


< mini -\jl- Kg 



A 




□ 

Theorem shows that bilinear projections can gener¬ 
ate similarity-preserving binary codes, where the expected 
Hamming distance is upper and lower bounded in terms 
of Kg{vec{X — Y)). Compared with the original upper 
and lower bounds in case of a single projection shown in 
the Lemma 2.3 CD, we derive the same upper bound and 
slightly worse lower bound as depicted in Fig. 

3.3. Bounds on Covariance 
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Figure 4. Upper bound on covariance between the two bits induced 
by BLSH-SIK. Horizontal axis suggests a Gaussian kernel value of 
two data points. Vertical axis shows an upper bound on covariance. 

In this section, we analyze the covariance between two 
bits induced by BLSH-SIK to address how much the per¬ 
formance would be dropped compared with a single large 
projection matrix. 

A hash function for multiple bits using bilinear projec¬ 
tions (|^ implies that there exists the bits which share one 
of the projection vectors. For example, assume that hi{-) is 
given as 


hi{X) = sgn(^ cosjtuj ^Xvi + bi) + . (12) 

We can easily find the following d^ — l hash functions which 
shares Wi with hi{-). 

hj{X) = sgn^cos(u;7 j + bj) + , (13) 


zero correlation. This phenomenon raises a natural ques¬ 
tion to ask that how much the two bits, which share one 
of projection vectors, are correlated. Intuitively, we expect 
that the highly correlated bits are not favorable, because 
such bits does contain redundant information to approxi¬ 


mate E 


X[h{X) ^ h{Y)] 


Theorem 


shows that the 


upper bound on covariance between twobits induced by 
bilinear projections is small, establishing the reason why 
BLSH-SIK performs well enough in case of a large number 
of bits. 


Theorem 3. Given the hash functions as Eq. (12-13), the 
upper bound on the covariance between the two bits is de¬ 
rived as 


cov(-) < 



where Kg{-) is the Gaussian kernel and cov(-) is the covari¬ 
ance between two bits defined as 

cov{-) = ^\^[h,{X) f hfY)]l[h,{X) f h,(Y)]] 

- ^\^[h,{X) f hfY)]Y\^[h,{X) f h,{Y)]]. 

Proof Since the proof is lengthy and tedious, the detailed 
proof and lower bound on the covariance can be found in 
the supplementary material. □ 

Fig. 12 depicts the upper bound on covariance between 
the two bits induced by BLSH-SIK with respect to Gaussian 
kernel value. We can easily see that the covariance between 
the two bits for the highly similar {Kg(\tc{X — Y)) « 1) is 
nearly zero, indicating that there is no correlation between 
the two bits. Unfortunately, there exists unfavorable corre¬ 
lation for the data points which is not highly (dis)similar. 
To remedy such unfavorable correlation, a simple heuris¬ 
tic is proposed, in which k x mf bits are first generated 
and randomly select the k bits when k is the desired num¬ 
ber of bits and to is a free parameter for reducing the un¬ 
favorable correlation trading-off storage and computational 
costs. This simple heuristic reduces the correlation between 
the two bits without incurring too much computational and 
storage costs. Algorithm[2summarizes the BLSH-SIK with 
the proposed heuristic. 


4. Experiments 


where j S {I,-- - ,dy}\{i}. 

If the two bits does not share any one of projection 
vectors, the bits should be independent which indicates a 


In this section, we represent the numerical experimental 
results to support the analysis presented in the previous sec¬ 
tions, validating the practical usefulness of BLSH-SIK. For 
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(a) 400bits 


(b) 900bits 


(c) l,600bits 


(d) 2,500bits 


Figure 5. Precision-recall curves for LSH-SIK with a single projection (referred to as LSH-SIK-Single) and BLSH-SIK (referred to as 
LSH-SIK-Bilinear) on MNIST with respect to the different number of bits. In case of BLSH-SIK, the precision-recall curves are plotted 
for the different m, which is introduced to reduce the correlation in Algorithm 1. 


Algorithm 1 LSH for Shift-invariant Kernels with Bilinear 

Projections (BLSH-SIK) 

Input: A data point is X S ^ desired 

number of bits, m is the hyper-parameter to reduce 
the correlation, and / is a subset with k elements of 
{1, 2, • • • , X k}. 

Output: A binary code of X with k bits. 

1 : W e and V G element-wise 

drawn from the zero-mean Gaussian, JV{0, 1). 

2: 6 G and t G are element-wise drawn from 
uniform distributions, Unif[0,27r] and Unif[—1, 
respectively. 

3: Generate a binary code whose the number of bit is fc x 
|(1 + sgn(cos(vec(VK^XV) + b) +1)). 

4: Select the fc-bits from the binary code using the pre¬ 
defined subset I. 


the numerical experiments, the two widely-used datasets, 
MNISTj^and Flickr45K[^ are used to investigate the behav¬ 
iors of BLSH-SIK from small- to high-dimensional data. 
MNIST consists of 70,000 handwritten digit images repre¬ 
sented by a 28-by-28 matrix, where the raw images are used 
for the experiments. Flickr45K is constructed by randomly 
selecting 45,000 images from 1 million Flickr images used 
in Q. VLAD m is used to represent an image with 500 
cluster centers, resulting in a 500 x 128 = 64, 000 dimen¬ 
sional vector normalized to the unit length with I 2 norm. For 
BLSH-SIK, we reshape an image into a 250-by-256 matrix. 

The ground-truth neighbors should be carefully con¬ 
structed for comparing the hashing algorithm in a fair man¬ 
ner. We adopt the same procedure to construct the ground- 
truth neighbors presented in ca. First of all, we decide 
an appropriate threshold to judge which neighbors should 
be ground-truth neighbors, where the averaged Euclidean 
distance between the query and the 50th nearest neighbor 

^http://y ann.lecun.com/exdb/mnist/ 

"^http://lear.inrialpes.fr/people/jegou/data.php 



Figure 6. Precision-recall curves for LSH-SIK with a single pro¬ 
jection (referred to as LSH-SIK-Single) and BLSH-SIK (referred 
to as LSH-SIK-Bilinear) on Flickr45K with respect to the different 
number of bits, where the precision-recall curves for BLSH-SIK 
are plotted for the different hyper-parameter m. 

is set to the appropriate threshold. Then, the ground-truth 
neighbor is decided if the distance between the query and 
the point is less than the threshold. Finally, we re-scale 
the dataset such that the threshold is one, leading that the 
scaling parameter for Gauassian kernel can be set to one. 
For both datasets, we randomly select 300 data points for 
queries, and the queries which has more than 5,000 ground- 
truth neighbors are excluded. To avoid any biased results, 
all precision-recall curves in this section are plotted by er¬ 
ror bars with mean and one standard deviation over 5 times 
repetition. 

Fig. and [^represent precision-recall curves for LSH- 
SIK with a single projection and BLSH-SIK on MNIST 
and Flickr45K with respect to the different number of bits. 
In case of BLSH-SIK, the precision-recall curves are plot¬ 
ted for the different m, which is introduced to reduce the 
correlation in Algorithm 1. From the both figures, we ob¬ 
serve that the larger m helps to reduce the correlation of the 
bits induced by BLSH-SIK. Even though BLSH-SIK cannot 
generate the same performance of LSH-SIK with a single 
projection, we argue that the performance is comparable. 
Moreover, the computational time and memory consump- 







































































































































(a) 16,900 bits (b) 25,600 bits 



(c) 40,000 bits 



(d) 62,500 bits 





(e) 16,900 bits (f) 25,600 bits (g) 40,000 bits (h) 62,500 bits 


Figure 8. Precision-recall curves for LSH-SIK with a single projection (referred to as Single) and BLSH-SIK (referred to as Bilinear) on 
Flickr45K when the same computational time for generating a binary code is required to LSH-SIK with a single projection and BLSH-SIK. 
The first (second) row shows the results when m of BLSH-SIK is one (five). 



(a) Computational time 



(b) Memory consumption 


Figure 7. Comparison between LSH-SIK with a single large pro¬ 
jection (referred to as Single) and BLSH-SIK (referred to as Bilin¬ 
ear) in terms of the computational time and memory consumption 
on the Flickr45K dataset. 


tion for generating binary codes are significantly reduced as 
explained in the next paragraph. 

Fig. 1^ represents the comparison between LSH-SIK 
with a single large projection and BLSH-SIK in terms of 
the computational time and memory consumption on the 
Flickr45K dataset. In case of BLSH-SIK, the time cost and 
memory consumption are reported with respect to the differ¬ 
ent m, which evidently shows that the computational time 
and memory consumption of BLSH-SIK are much smaller 
than LSH-SIK with a single projection. From Fig. [^|^and 

^To measure the computational time, a single thread is used with a Intel 
i7 3.60GHz machine (64GB main memory). Fig. |7](a) does not include 
the computational time of LSH-SIK with a single projection for 62,500bits 
due to the high memory consumption. 


|7] we can conclude that m = 5 is a good choice for BLSH- 
SIK, because m = 5 performs well compared to m = 10 
but it is much faster than m = 10. 

Fig. [^represents the precision-recall curves for LSH- 
SIK with a single projection and BLSH-SIK on Flickr45K 
with the same computational time limitation for generating 
a binary code. Therefore, fewer bits are used for LSH-SIK 
with a single projection compared to BLSH-SIK. For both 
m = 1 and m = 5, BLSH-SIK is superior to LSH-SIK with 
a single projection with the same computational time. 

5. Conclusions 

In this paper we have presented a bilinear extension of 
LSH-SIK ifTTIl . referred to as BLSH-SIK, where we proved 
that the expected Hamming distance between the binary 
codes of two vectors is related to the value of Gaussian ker¬ 
nel when column vectors of projection matrices are inde¬ 
pendently drawn from spherical Gaussian distribution. Our 
theoretical analysis have confirmed that: (1) randomized bi¬ 
linear projection yields similarity-preserving binary codes; 
(2) the performance of BLSH-SIK is comparable to LSH- 
SIK, showing that the correlation between two bits of bi¬ 
nary codes computed by BLSH-SIK is small. Numerical 
experiments on MNIST and Flickr45K datasets confirmed 
the validity of our method. 
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A. Bounds on the Expected Hamming Distance 

Lemma 3. ^771/ For any u,v € [—1,1], Pt{sgn{u + t)^ sgn{v + t)} = |m — 'y|/2. 

Lemma 4. (Lemma 2 in the paper) 

^w.vAt[h^X)^h{Y)\ 2^ -4^2 _i-> 

m—1 

where h{X) = i(l + sgn{cos{'w^ Xv + b) + t)), w,v ^ A/'(0, 1), b ~ Unif[0, 2 tt\, and t ^ Unif[—1, 1]. 

Proof. Using Lemma^ we can show that = |Ei(; .y ;,|cos(t(j^Xi; + &) — cos(m^l^t; + 6)|. By 

using a trigonometry identity, 

1 2 (X — Y)v 

-Eb^w,v\cos{w^Xv + b) - cos(m^Yr; + 6)1 = -Eiu,y|sin(- - -)|. 

By im, we use Fourier series of ^(t) = |sin(r)|; 

, . 4 1 —cos(2TOr) 

= - z. 

This formula leads to the following equation: 


TT ^—' 4m^ — 1 

m—l 


1 8 ^ 1 - Eiy.y [cos(TOm^(X - y)r;)] 

'^w,vAt[h(X)^hiY)\ = ^^2 _™~ 


TTL — l 


Am? — 1 


According to the proof of Lemma 1 described in the paper, we know that Ew,vCos{mw^ {X — Y)v) = Kbi{mX — mY), 
which completes the proof. Q.E.D. 


B. Bounds on the Covariance hy Bilinear Projections 

Lemma 5. Given a datum as X £ bilinear projections, w,Vi,V 2 , are drawn from the 

AV 2 )] = Kbi{y/(pP + 1 T?)X). 

Proof 

Ey;,yi,y2 [cos(mm^Xt;i)cos(ntu^Xr;2)] 

= J I- J cos(mm^Xr;i)p(t;i)(ii;i] [ J cos{mw^Xv 2 )p{v 2 )dv 2 ]p{w)dw 

= J Kg{mw^X)Kg{nw^X)p{w)dw (by Lemma 1 in the paper) 

= f _ exp(— {w^[I + m^XX^ + r2XX^]w))dw 

J V27r 2 

= \I+{m^+n^)XX^\-^ ^Kb^{^/{m?A^X). 

Theorem 4. (Theorem 3 in the paper) Given the hash functions hif) and h 2 {-), the upper bound on the covariance between 
the two bits is derived as 


, , , 64, I 

( 3 ) (E 


m—l 


KgivecjX - y)) 
4m2 — 1 


0.79™-“ 


-(E 


m—l 


Kg{vec{X-Y)r f 
Am^ — 1 . 


where Kgf) is the Gaussian kernel and cov(-) is the covariance between two bits defined as 


cov(-) = ^w,v^,v^MMMM[\,(X)^h^(Y)\ 2 {X)^h 2 (Y)\ 

- ^w,v^MM'^h,{X)^haY)]^w.v^,b^,t^\lb2{X)^h2(Y)\^ 
hi{X) = sgn^cos(m^Xi;i + 61 ) + 

/12(A) = ign^cos(m^Ai;2 + 62) + ^2^ 














Proof. First, we want to derive the first term in the covariance in terms of 

'^W,Vi,V 2 ,bl,b 2 ,tl,t 2 [^hi{X)^hi(Y)\ 2 iX)^h 2 {Y)] 

= ^'^w,Vi,V2,bi,b2 [|cos(ro^XDi + bi) — cos(w^Vvi + bi)llcos(w^Xv2 + &2) — cos(w^Vv2 + 62)1] Lemma[^ 

4^ r, . ,w^(X-V)vi, . ,w^(X-V)v 2 ,n 

= [|sm( ^ )sin( ^ )|J 

,64, r, 1 — cos(mr(;^(X — K)i;i), , 1 — cos(nio^(X — V)i; 2 )\-i 

= (^) 2. -)J 

m,n=l 

Using the Lemma 2, the hrst term in the covariance can be represented in terms of Kbi(-): 

J^W,V2,bub2,ti,t2 [^hi{X)^hi{Y)^h2iX)^h2{Y)] 

- — ^ — 7T^ — 7(1 “ Kbi{mX - mY) - Ku{nX - nY) + ku{\/ (m^ + ri^){X - Y))) 

7^1 ' - ‘ 4771^ — 1 4772 _ I '■ 

m,n—l 

The second term in the covariance is also represented in terms of Kbi( ): 

E. 

5 - i. 5-i 7-± L Af ] 

" 1 1 


- <!^) E 


w,Vi,bi,ti \^hi(X)^h 2 {Y'^'^w.V 2 ,b 2 ,t 2 p/12(X)7^/12('F)] 

00 


= (—) F 

^ 7r4 4777,2 _ 4772 _ Y 

m,n—l 


(1 - ^^b^{m{X - F)) - tiu{n[X - Y)) + Ku{m{X - Y))kMX - F))) 


Therefore, the covariance between two bits is computed as 
fS4 1 1 

= (-a) E [^^biWim? + n^){X - F)) - Ku{m{X - F))K,,(n(X - F))] 

m,n=l 

fS4 1 1 

- E 4,772 _ 1 47^2 _ 1 + 'n?){X - F)))°'^^ - Kg{w&c{mX - mY))Kgiwec{nX - wF))] 

m,n=l 

1^77^ F 2 ec(X - _ K^(vec(X - F))™'K,(vec(X - F))"'] 


64 ^ ^ 

'774 47772 _ 

m,n=l 


(5)[(E 

m=l 


/c,(vec(X-F))° 
4/77^ — 1 


/7,(vec(X-F)r ,22 

1 47772 _ 1 n ’ 


where the second inequality is given by Lemma 1 in the paper («;g(vec(X — F)) < Kbi{X — Y) < /tg(vec(X — F))° '®) 
and the third equality is given by/Cg(vec(mX — mF)) = Kg(vec(X — F))™ . The lower bound can be derived in a similar 
way. 

Corollary 1 . Given the hash functions /ii (•) and h2{-), the lower bound on the covariance between the two bits is derived as 

-(.). (S)[( E - (E 


■ TT-i L' —' Am? — 1 

m—1 


Am^ — 1 


where Kg{-) is the Gaussian kernel and cov(-) is the covariance between two bits defined as 

cov(-) = Eu;,r'i.t;2,bi.f>2.ti,t2p,7i(X)7^/7i(F)E(X)7^/i2(F)] 

- \^hi{X)^hi(Y)]^'W.V2,b2,t2 P/72(X)//l2(F)]’ 





















